Workshop: Creating and Transforming Variables

Author

Olga Mushakarara

Introduction

For this workshop, you’ll be cleaning and modifying a dataset with functions like drop(), rename(), to_numeric(), and replace(). These have been covered in the last few weeks of the prework.

Step 1: Load packages and data

To get started, load in the needed packages: pandas, numpy, and plotly.express.

import pandas as pd
import numpy as np
import plotly.express as px

Now, read in the dataset called physical_activity.csv from your data folder and save it as pa.

It should have 142 rows and 9 columns.

# reading the physical_activity csv
pa = pd.read_csv("Data/physical_activity.csv")
pa
personal_id household_id sex status bmi sedentary_ap_s_day light_ap_s_day mvpa_s_day oms_recommendation
0 1 2 M Adulte 32.2 41483 5782.571429 816.5714286 No
1 2 2 F Adulte 26.1 43091 4102.571429 486.7142857 No
2 3 2 F Enfant 16.7 38630.71429 1874.571429 2806 No
3 4 2 M Enfant 17.1 40290.28571 3196 4712 Yes
4 5 3 M Adulte missing 73121.57143 863.2857143 267.5714286 No
... ... ... ... ... ... ... ... ... ...
137 138 39 F Enfant 20.1 41610 3426.285714 1378.428571 No
138 139 39 F Enfant 19.8 35515.14286 4281.285714 2385.857143 No
139 140 40 M Adulte 37.1 38259.28571 4905.285714 1153.857143 Yes
140 141 40 F Adulte 37.6 44304.85714 5039.857143 664.2857143 No
141 142 40 F Enfant 35.6 45467.42857 3039.857143 1137.428571 No

142 rows × 9 columns

Step 2: Subset and rename variables

Next, drop the “personal_id” and “household_id” variables from the DataFrame.

# dropping the personal_id and household_id variables using drop: 

pa = pa.drop(columns = ["personal_id","household_id"])

pa
sex status bmi sedentary_ap_s_day light_ap_s_day mvpa_s_day oms_recommendation
0 M Adulte 32.2 41483 5782.571429 816.5714286 No
1 F Adulte 26.1 43091 4102.571429 486.7142857 No
2 F Enfant 16.7 38630.71429 1874.571429 2806 No
3 M Enfant 17.1 40290.28571 3196 4712 Yes
4 M Adulte missing 73121.57143 863.2857143 267.5714286 No
... ... ... ... ... ... ... ...
137 F Enfant 20.1 41610 3426.285714 1378.428571 No
138 F Enfant 19.8 35515.14286 4281.285714 2385.857143 No
139 M Adulte 37.1 38259.28571 4905.285714 1153.857143 Yes
140 F Adulte 37.6 44304.85714 5039.857143 664.2857143 No
141 F Enfant 35.6 45467.42857 3039.857143 1137.428571 No

142 rows × 7 columns

Now, rename oms_recommendation to who_recommendation. (OMS is the French name for WHO.)

# renaming the  `oms_recommendation` to `who_recommendation`. (OMS is the French name for WHO.)

pa = pa.rename(columns={"oms_recommendation": "who_recommendation"})

pa 
sex status bmi sedentary_ap_s_day light_ap_s_day mvpa_s_day who_recommendation
0 M Adulte 32.2 41483 5782.571429 816.5714286 No
1 F Adulte 26.1 43091 4102.571429 486.7142857 No
2 F Enfant 16.7 38630.71429 1874.571429 2806 No
3 M Enfant 17.1 40290.28571 3196 4712 Yes
4 M Adulte missing 73121.57143 863.2857143 267.5714286 No
... ... ... ... ... ... ... ...
137 F Enfant 20.1 41610 3426.285714 1378.428571 No
138 F Enfant 19.8 35515.14286 4281.285714 2385.857143 No
139 M Adulte 37.1 38259.28571 4905.285714 1153.857143 Yes
140 F Adulte 37.6 44304.85714 5039.857143 664.2857143 No
141 F Enfant 35.6 45467.42857 3039.857143 1137.428571 No

142 rows × 7 columns

Your data should now have 7 variables:

  • sex: male or female
  • status: adult (Adulte) or child (Enfant)
  • bmi: body mass index
  • sedentary_ap_s_day: sedentary behaviour (seconds/day)
  • light_ap_s_day: light physical activity (seconds/day)
  • mvpa_s_day: moderate to vigorous physical activity (seconds/day)
  • who_recommendation: whether or not WHO recommendations for physical activity are met

Step 3: Examine and convert data types

Let’s think about the types of variables in our dataset.

Check the data types that were automatically assigned to each variable with info().

# your code here
pa.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 142 entries, 0 to 141
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   sex                 142 non-null    object
 1   status              142 non-null    object
 2   bmi                 142 non-null    object
 3   sedentary_ap_s_day  142 non-null    object
 4   light_ap_s_day      142 non-null    object
 5   mvpa_s_day          142 non-null    object
 6   who_recommendation  133 non-null    object
dtypes: object(7)
memory usage: 7.9+ KB

Notice that all of your variables are of type object (which is a catch-all for strings), but some should be numeric (e.g., bmi).

This is because the numeric variables have the words “missing” in them to indicate that the value is missing. Can you spot those words when you view the dataset?

We’ll use the pd.to_numeric() function to convert these variables to numeric.

Here is an example of how to use this.

test_df = pd.DataFrame({"bmi": ["32.2", "missing", "18.8", "17.8", "21.1"]})
test_df["bmi"] = pd.to_numeric(test_df["bmi"], errors="coerce")
test_df
bmi
0 32.2
1 NaN
2 18.8
3 17.8
4 21.1

The missing values are represented as NaN after the conversion.

Now, fill in the blanks below to apply this to your pa DataFrame.

# Change variables' type
pa["bmi"] = pd.to_numeric(pa["bmi"], errors="coerce")
pa["sedentary_ap_s_day"] = pd.to_numeric(pa["sedentary_ap_s_day"], errors="coerce")
pa["light_ap_s_day"] = pd.to_numeric(pa["light_ap_s_day"], errors="coerce")
pa["mvpa_s_day"] = pd.to_numeric(pa["mvpa_s_day"], errors="coerce")

Check the data types again with info().

Step 4: Recode the status variable

Next, notice that the status variable currently uses the French terms “Adulte” and “Enfant”.

# Run this code to check the values of status
pa["status"].value_counts()
status
Adulte    72
Enfant    70
Name: count, dtype: int64

Use the .replace() method to transform the status variable by replacing “Adulte” with “Adult” and “Enfant” with “Child”.

# changing the values Adulte -> Adult and Enfant -> Child 

pa["status"] = pa["status"].replace({"Adulte": "Adult", 
"Enfant": "Child"})

pa["status"].value_counts()
status
Adult    72
Child    70
Name: count, dtype: int64

Step 5: Convert the physical activity variables

There are 3 physical activity variables: sedentary behavior (sedentary_ap_s_day), light physical activity (light_ap_s_day), and moderate to vigorous physical activity (mvpa_s_day).

These variables are currently measured in seconds per day.

However, most recommendations for physical activity are in minutes per week, so we want to align with these measures.

To do this, complete the following manipulations:

  1. Use arithmetic operations to create new variables called sedentary_ap_m_wk, light_ap_m_wk, and mvpa_m_wk that are in minutes per week.

  2. Remove the previous seconds per day variables.

# Converting the secs/day to mins/ wk using :  (x/60)*7 

pa["sedentary_ap_m_wk"] = (pa["sedentary_ap_s_day"]/60) * 7 

pa["light_ap_m_wk"] = (pa["light_ap_s_day"]/60) * 7 

pa["mvpa_m_wk"] = (pa["mvpa_s_day"]/60) * 7 

pa
sex status bmi sedentary_ap_s_day light_ap_s_day mvpa_s_day who_recommendation sedentary_ap_m_wk light_ap_m_wk mvpa_m_wk
0 M Adult 32.2 41483.00000 5782.571429 816.571429 No 4839.683333 674.633333 95.266667
1 F Adult 26.1 43091.00000 4102.571429 486.714286 No 5027.283333 478.633333 56.783333
2 F Child 16.7 38630.71429 1874.571429 2806.000000 No 4506.916667 218.700000 327.366667
3 M Child 17.1 40290.28571 3196.000000 4712.000000 Yes 4700.533333 372.866667 549.733333
4 M Adult NaN 73121.57143 863.285714 267.571429 No 8530.850000 100.716667 31.216667
... ... ... ... ... ... ... ... ... ... ...
137 F Child 20.1 41610.00000 3426.285714 1378.428571 No 4854.500000 399.733333 160.816667
138 F Child 19.8 35515.14286 4281.285714 2385.857143 No 4143.433334 499.483333 278.350000
139 M Adult 37.1 38259.28571 4905.285714 1153.857143 Yes 4463.583333 572.283333 134.616667
140 F Adult 37.6 44304.85714 5039.857143 664.285714 No 5168.900000 587.983333 77.500000
141 F Child 35.6 45467.42857 3039.857143 1137.428571 No 5304.533333 354.650000 132.700000

142 rows × 10 columns

# Dropping the s_day variables: 

pa = pa.drop(columns=["sedentary_ap_s_day", "light_ap_s_day", "mvpa_s_day"])

pa
sex status bmi who_recommendation sedentary_ap_m_wk light_ap_m_wk mvpa_m_wk
0 M Adult 32.2 No 4839.683333 674.633333 95.266667
1 F Adult 26.1 No 5027.283333 478.633333 56.783333
2 F Child 16.7 No 4506.916667 218.700000 327.366667
3 M Child 17.1 Yes 4700.533333 372.866667 549.733333
4 M Adult NaN No 8530.850000 100.716667 31.216667
... ... ... ... ... ... ... ...
137 F Child 20.1 No 4854.500000 399.733333 160.816667
138 F Child 19.8 No 4143.433334 499.483333 278.350000
139 M Adult 37.1 Yes 4463.583333 572.283333 134.616667
140 F Adult 37.6 No 5168.900000 587.983333 77.500000
141 F Child 35.6 No 5304.533333 354.650000 132.700000

142 rows × 7 columns

Step 6: Total physical activity

Create a new column that adds light physical activity and moderate to vigorous physical activity, and call it total_pa_m_wk.

This should give us the total amount of activity in minutes per week.

# Adding "light_ap_m_wk" and mvpa_m_k 

pa["total_pa_m_wk"] = pa["light_ap_m_wk"] + pa["mvpa_m_wk"] 

pa
sex status bmi who_recommendation sedentary_ap_m_wk light_ap_m_wk mvpa_m_wk total_pa_m_wk
0 M Adult 32.2 No 4839.683333 674.633333 95.266667 769.900000
1 F Adult 26.1 No 5027.283333 478.633333 56.783333 535.416667
2 F Child 16.7 No 4506.916667 218.700000 327.366667 546.066667
3 M Child 17.1 Yes 4700.533333 372.866667 549.733333 922.600000
4 M Adult NaN No 8530.850000 100.716667 31.216667 131.933333
... ... ... ... ... ... ... ... ...
137 F Child 20.1 No 4854.500000 399.733333 160.816667 560.550000
138 F Child 19.8 No 4143.433334 499.483333 278.350000 777.833333
139 M Adult 37.1 Yes 4463.583333 572.283333 134.616667 706.900000
140 F Adult 37.6 No 5168.900000 587.983333 77.500000 665.483333
141 F Child 35.6 No 5304.533333 354.650000 132.700000 487.350000

142 rows × 8 columns

Step 7: Categorical variable for total physical activity

Write a custom function that bins its input variable into three categories:

  • low: less than 500 minutes per week (< 500)
  • medium: between 500 and 1000 minutes per week (>= 500 and < 1000)
  • high: greater than 1000 minutes per week (>= 1000)

Apply this function to the pa DataFrame to create a categorical variable called total_pa_cat

# defining a ftn to return low, medium and high: 

def activity_cat (total_phy_activity): 
   if total_phy_activity < 500: 
      return "low"
   elif total_phy_activity >= 500 and total_phy_activity < 1000: 
      return "medium"
   elif total_phy_activity >= 1000: 
      return "high"
   else: 
      return "ungraded"


activity_cat_vec = np.vectorize(activity_cat)
# creating a new variable total_pa_cat using the vectorized ftn : 

pa["total_pa_cat"] = activity_cat_vec(pa["total_pa_m_wk"])

pa 
sex status bmi who_recommendation sedentary_ap_m_wk light_ap_m_wk mvpa_m_wk total_pa_m_wk total_pa_cat
0 M Adult 32.2 No 4839.683333 674.633333 95.266667 769.900000 medium
1 F Adult 26.1 No 5027.283333 478.633333 56.783333 535.416667 medium
2 F Child 16.7 No 4506.916667 218.700000 327.366667 546.066667 medium
3 M Child 17.1 Yes 4700.533333 372.866667 549.733333 922.600000 medium
4 M Adult NaN No 8530.850000 100.716667 31.216667 131.933333 low
... ... ... ... ... ... ... ... ... ...
137 F Child 20.1 No 4854.500000 399.733333 160.816667 560.550000 medium
138 F Child 19.8 No 4143.433334 499.483333 278.350000 777.833333 medium
139 M Adult 37.1 Yes 4463.583333 572.283333 134.616667 706.900000 medium
140 F Adult 37.6 No 5168.900000 587.983333 77.500000 665.483333 medium
141 F Child 35.6 No 5304.533333 354.650000 132.700000 487.350000 low

142 rows × 9 columns

Step 8: Visualize data with plotly express

In this section you will use pa to generate some plots!

Plot 1: Histogram

Histograms are best used to visualize the distribution of a single numeric (continuous) variable. Choose a variable from the dataset that you can plot a histogram with.

Using plotly express, create and print a histogram.

# creating a histogram using the total activity per minute per week 

pa_plot = px.histogram(
   pa,
   x = "total_pa_m_wk", 
   labels= {"total_pa_m_wk": "Total Physical Activity per hr per week"}, 
   title = "Distibution of the total physical activity done by each respondent per hr per week")

pa_plot
  • Moderate physical activity (400-700 hours per week) is the most common.
  • Very high physical activity levels are rare but present.
  • The distribution suggests that extreme physical activity levels are uncommon.

Plot 2: Boxplot

Boxplots are best used to visualize the distribution of a numeric variable, split by the values of a discrete/categorical variable.

Use plotly express to create and print a boxplot.

Use the same variable used for your histogram on your y-axis, and plot it against a categorical variable on your x-axis.

Set fill color to match the values of your categorical variable.

# creating a boxplot disaggregated by respondent's status: 

status_plot = px.box(pa, 
y = "total_pa_m_wk", 
x = "status", 
color = "status", 
title = "Distribution of the total weekly physical activity by respondent status")

status_plot

Write a short explanation of what you see in the plot. - Adults generally have a higher median and more variable weekly physical activity compared to children. - Children’s activity levels are more consistent, but some outliers indicate highly active or inactive individuals. - Both groups have extreme cases of individuals engaging in very high weekly physical activity.the adults.

Optional Challenge Question: Health Risk Classification

  1. Write a function assess_risk() that:

    • Returns "High Risk" if BMI ≥ 30 and total physical activity < 500 min/week
    • Otherwise, returns "Not High Risk"
  2. Create a new column risk_category in your pa DataFrame by applying this function.

  3. Make a percentage stacked bar chart showing the count of "High Risk" and "Not High Risk" individuals, grouped by sex.

Comment on the findings. Which gender has a higher proportion of high risk individuals?

# Creating a ftn that assess rick based on BMI and total physical activity: 

def assess_risk(bmi_score, total_pa): 
   if bmi_score >= 30 and total_pa < 500: 
      return "High Risk"
   else: 
      return "Not High Risk"

assess_risk_vec = np.vectorize(assess_risk)
pa["risk_category"] = assess_risk_vec(pa["bmi"], pa["total_pa_m_wk"])

pa
sex status bmi who_recommendation sedentary_ap_m_wk light_ap_m_wk mvpa_m_wk total_pa_m_wk total_pa_cat risk_category
0 M Adult 32.2 No 4839.683333 674.633333 95.266667 769.900000 medium Not High Risk
1 F Adult 26.1 No 5027.283333 478.633333 56.783333 535.416667 medium Not High Risk
2 F Child 16.7 No 4506.916667 218.700000 327.366667 546.066667 medium Not High Risk
3 M Child 17.1 Yes 4700.533333 372.866667 549.733333 922.600000 medium Not High Risk
4 M Adult NaN No 8530.850000 100.716667 31.216667 131.933333 low Not High Risk
... ... ... ... ... ... ... ... ... ... ...
137 F Child 20.1 No 4854.500000 399.733333 160.816667 560.550000 medium Not High Risk
138 F Child 19.8 No 4143.433334 499.483333 278.350000 777.833333 medium Not High Risk
139 M Adult 37.1 Yes 4463.583333 572.283333 134.616667 706.900000 medium Not High Risk
140 F Adult 37.6 No 5168.900000 587.983333 77.500000 665.483333 medium Not High Risk
141 F Child 35.6 No 5304.533333 354.650000 132.700000 487.350000 low High Risk

142 rows × 10 columns

# stacked bar chart of the risk category by sex: 

risk_cat_plot = px.histogram(
   pa, 
   x = "risk_category", 
   color= "sex", 
   barnorm = "percent", 
   text_auto = ".1f"
)

risk_cat_plot
  • In the Not High Risk group, males and females are nearly equally represented.
  • In the High Risk group, females significantly outnumber males, making up nearly 73% of this category.
  • This may indicate that females are more likely to be classified as high risk compared to males in this dataset.